95 research outputs found

    GOGGLES: Automatic Image Labeling with Affinity Coding

    Full text link
    Generating large labeled training data is becoming the biggest bottleneck in building and deploying supervised machine learning models. Recently, the data programming paradigm has been proposed to reduce the human cost in labeling training data. However, data programming relies on designing labeling functions which still requires significant domain expertise. Also, it is prohibitively difficult to write labeling functions for image datasets as it is hard to express domain knowledge using raw features for images (pixels). We propose affinity coding, a new domain-agnostic paradigm for automated training data labeling. The core premise of affinity coding is that the affinity scores of instance pairs belonging to the same class on average should be higher than those of pairs belonging to different classes, according to some affinity functions. We build the GOGGLES system that implements affinity coding for labeling image datasets by designing a novel set of reusable affinity functions for images, and propose a novel hierarchical generative model for class inference using a small development set. We compare GOGGLES with existing data programming systems on 5 image labeling tasks from diverse domains. GOGGLES achieves labeling accuracies ranging from a minimum of 71% to a maximum of 98% without requiring any extensive human annotation. In terms of end-to-end performance, GOGGLES outperforms the state-of-the-art data programming system Snuba by 21% and a state-of-the-art few-shot learning technique by 5%, and is only 7% away from the fully supervised upper bound.Comment: Published at 2020 ACM SIGMOD International Conference on Management of Dat

    Rethinking Similarity Search: Embracing Smarter Mechanisms over Smarter Data

    Full text link
    In this vision paper, we propose a shift in perspective for improving the effectiveness of similarity search. Rather than focusing solely on enhancing the data quality, particularly machine learning-generated embeddings, we advocate for a more comprehensive approach that also enhances the underpinning search mechanisms. We highlight three novel avenues that call for a redefinition of the similarity search problem: exploiting implicit data structures and distributions, engaging users in an iterative feedback loop, and moving beyond a single query vector. These novel pathways have gained relevance in emerging applications such as large-scale language models, video clip retrieval, and data labeling. We discuss the corresponding research challenges posed by these new problem areas and share insights from our preliminary discoveries

    Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions

    Full text link
    Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables, which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of "Certain Predictions" (CP) -- a test data example can be certainly predicted (CP'ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP'ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed -- we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of "data cleaning for machine learning (DC for ML)." We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques in terms of classification accuracy with mild manual cleaning effort

    Experiences and Lessons Learned from the SIGMOD Entity Resolution Programming Contests

    Get PDF
    We report our experience in running three editions (2020, 2021, 2022) of the SIGMOD programming contest, a well-known event for students to engage in solving exciting data management problems. During this period we had the opportunity of introducing participants to the entity resolution task, which is of paramount importance in the data integration community. We aim at sharing the executive decisions, made by the people co-authoring this report, and the lessons learned

    FAST discovery of a fast neutral hydrogen outflow

    Full text link
    In this letter, we report the discovery of a fast neutral hydrogen outflow in SDSS J145239.38+062738.0, a merging radio galaxy containing an optical type I active galactic nuclei (AGN). This discovery was made through observations conducted by the Five-hundred-meter Aperture Spherical radio Telescope (FAST) using redshifted 21-cm absorption. The outflow exhibits a blueshifted velocity likely up to ∼−1000 km s−1\sim-1000\,\rm km\,s^{-1} with respect to the systemic velocity of the host galaxy with an absorption strength of ∼−0.6 mJy beam−1\sim -0.6\,\rm mJy\,beam^{-1} corresponding to an optical depth of 0.002 at v=−500 km s−1v=-500\,\rm km\,s^{-1}. The mass outflow rate ranges between 2.8×10−22.8\times10^{-2} and 3.6 M⊙ yr−13.6\, \rm M_\odot \, yr^{-1}, implying an energy outflow rate ranging between 4.2×10394.2\times10^{39} and 9.7×1040 erg s−19.7\times10^{40}\rm\,erg\,s^{-1}, assuming 100 K <Ts<<T_{\rm s}< 1000 K. Plausible drivers of the outflow include the star bursts, the AGN radiation, and the radio jet, the last of which is considered the most likely culprit according to the kinematics. By analysing the properties of the outflow, the AGN, and the jet, we find that if the HI outflow is driven by the AGN radiation, the AGN radiation seems not powerful enough to provide negative feedback whereas the radio jet shows the potential to provide negative feedback. Our observations contribute another example of a fast outflow detected in neutral hydrogen, as well as demonstrate the capability of FAST in detecting such outflows.Comment: Accepted by ApJ

    Does a radio jet drive the massive multi-phase outflow in the ultra-luminous infrared galaxy IRAS 10565+2448?

    Full text link
    We present new upgraded Giant Metrewave Radio Telescope (uGMRT) HI 21-cm observations of the ultra-luminous infrared galaxy IRAS 10565+2448, previously reported to show blueshifted, broad, and shallow HI absorption indicating an outflow. Our higher spatial resolution observations have localised this blueshifted outflow, which is ∼\sim 1.36 kpc southwest of the radio centre and has a blueshifted velocity of ∼148 km s−1\sim 148\,\rm km\,s^{-1} and a full width at half maximum (FWHM) of ∼581 km s−1\sim 581\,\rm km\,s^{-1}. The spatial extent and kinematic properties of the HI outflow are consistent with the previously detected cold molecular outflows in IRAS 10565+2448, suggesting that they likely have the same driving mechanism and are tracing the same outflow. By combining the multi-phase gas observations, we estimate a total outflowing mass rate of at least 140 M⊙ yr−1140\, \rm M_\odot \,yr^{-1} and a total energy loss rate of at least 8.9×1042 erg s−18.9\times10^{42}\,\rm erg\,s^{-1}, where the contribution from the ionised outflow is negligible, emphasising the importance of including both cold neutral and molecular gas when quantifying the impact of outflows. We present evidence of the presence of a radio jet and argue that this may play a role in driving the observed outflows. The modest radio luminosity L1.4GHzL_{\rm1.4GHz} ∼1.3×1023 W Hz−1\sim1.3\times10^{23}\,{\rm W\,Hz^{-1}} of the jet in IRAS 10565+2448 implies that the jet contribution to driving outflows should not be ignored in low radio luminosity AGN.Comment: 12 pages, 9 figures, accepted for publication in MNRA

    Theoretical Investigations into Self-Organized Ordered Metallic Semi-Clusters Arrays on Metallic Substrate

    Get PDF
    Using the energy minimization calculations based on an interfacial potential and a first-principles total energy method, respectively, we show that (2 × 2)/(3 × 3) Pb/Cu(111) system is a stable structure among all the [(n − 1) × (n − 1)]/(n × n) Pb/Cu(111) (n = 2, 3,…, 12) structures. The electronic structure calculations indicate that self-organized ordered Pb semi-clusters arrays are formed on the first Pb monolayer of (2 × 2)/(3 × 3) Pb/Cu(111), which is due to a strain-release effect induced by the inherent misfits. The Pb semi-clusters structure can generate selective adsorption of atoms of semiconductor materials (e.g., Ge) around the semi-clusters, therefore, can be used as a template for the growth of nanoscale structures with a very short periodic length (7.67 Å)
    • …
    corecore